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We propose an adaptive independent Metropolis-Hastings algo- 
rithm with the ability to learn from all previous proposals in the 
chain except the current location. It is an extension of the indepen- 
dent Metropolis-Hastings algorithm. Convergence is proved provided 
a strong Doeblin condition is satisfied, which essentially requires that 
all the proposal functions have uniformly heavier tails than the sta- 
tionary distribution. The proof also holds if proposals depending on 
the current state are used intermittently, provided the information 
from these iterations is not used for adaption. The algorithm gives 
samples from the exact distribution within a finite number of itera- 
tions with probability arbitrarily close to 1. The algorithm is partic- 
ularly useful when a large number of samples from the same distribu- 
tion is necessary, like in Bayesian estimation, and in CPU intensive 
applications like, for example, in inverse problems and optimization. 

1. Introduction. Assume we want to sample from a distribution tt or 
make estimates based on tt, but direct samples from tt are not obtainable. 
Rejection sampling importance sampling, and sampling importance resam- 
pling (SIR) are techniques for generating such samples and estimates by 
proposing samples from a different distribution q. Another alternative is 
to use a Metropolis-Hastings algorithm with a proposal function that is 
independent of the present position. This approach, which we call indepen- 
dent Metropolis-Hastings, is also known as independent Markov chain in 
Tierney (1994), or Metropolized independent sampling in Liu (1996), or 
independence sampler in Roberts and Rosenthal (1998). The efficiency of 
these methods depends on the proposal distribution q being close to tt. If 
this is not practically possible, other Markov chains such as Metropolis- 
Hastings based on local moves around the present state may be better; see 
Meyn and Tweedie (1993), Gilks et al. (1996) and Geyer (1992). 
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In all the alternatives described above, knowledge about the stationary 
distribution is gained as the number of iterations increases. This knowledge 
may be used to adapt the proposal distribution in order to improve the 
convergence of the chain. The bound on the convergence improves and the 
bound on the correlation between subsequent states in the chain decreases 
when the proposal distribution better approximates the stationary distribu- 
tion; see the proposition in Holden (1998b). 

Adaptive Markov chains is an active research area; see, for example; 
Atchade and Rosenthal (2003) and the references therein. The paper 
Roberts and Rosenthal (2005) has several general results for adaptive Markov 
chains, but they focus on diminishing adaption that is not necessarily satis- 
fied by the method proposed in this paper. Erland (2003) gives an overview 
of adaptive algorithms. He divides adaptive algorithms into three groups: 
adaptive strategies within ordinary MCMC, algorithms where the adap- 
tion diminishes and algorithms with regeneration. There are several papers 
proposing very different algorithms within each group. The possibilities for 
adaption within ordinary MCMC is limited; see Tjelmeland and Hegstad 
(2001). Regeneration tends to happen so seldom in higher dimension that 
this limits the applicability of these algorithms; see Gilks et al. (1998). Al- 
gorithms where the adaption diminishes are probably the most promising; 
see for example, Haario et al. (2001). However, also these algorithms have 
restrictions that make them difficult to use. 

The algorithm presented in this paper is within the diminishing subgroup 
even though the adaption does not need to diminish. It is less technical than 
the other algorithms, and it is easy to describe and use in practice. Also, it 
has a proposal function that depends on all earlier proposed states except 
the current, and a Metropolis-Hastings- like acceptance step. There is a large 
flexibility in the choice of proposal function and it is a challenge to find a 
proposal function that is able to use the data efficiently in order to obtain 
satisfactory convergence. This is discussed in the paper and four different 
examples are given with both parametric and nonparametric alternatives. 
We call this algorithm the adaptive independent Metropolis-Hastings al- 
gorithm. This is a special case of the adaptive chain described in Holden 
(1998a). Surprisingly, the limiting density is invariant with an independent 
proposal function and hence this gives much better convergence properties 
than in the general case. The algorithm is particularly designed for examples 
where it is expensive to calculate the limiting density it. In such examples 
it is most likely cost effective to use a lot of data in an adaptive scheme. 

Gasemyr (2003) describes another adaptive algorithm based on the inde- 
pendent Metropolis-Hastings algorithm. The convergence rate of both that 
algorithm and our algorithm depends on the ratio vr/g,, where qi denotes 
the proposal distribution in iteration i. Both algorithms do also give exact 
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samples in a finite number of iterations. The adaption schemes seem dif- 
ferent, as the algorithm in Gasemyr (2003) adapts on previously accepted 
states, whereas our algorithm adapts on proposed states. However, this dif- 
ference is not as essential since the algorithm presented in Gasemyr (2003) 
may easily use proposed states instead. From all the proposed states, it is 
possible to generate a new independent chain with the same properties as 
the accepted chain. The main difference is that the algorithm described in 
Gasemyr (2003) requires that the supremum of the ratio f /qi is known, 
where / = ctt and c is an unknown constant. This supremum is used in the 
algorithm, and although the algorithm will converge without it, the con- 
vergence will be weaker. It adapts using groups of previous states, and the 
adaption stops after a finite number of iterations. The algorithm presented 
in this paper only requires that the supremum of n/qi is finite, it may use all 
previous states in the adaption and the adaption does not stop. It may also 
include iterations with proposal functions that depend on the present posi- 
tion, but the information from these iterations cannot be used for adaption. 
Andrieu and Moulines (2003) also discuss independent Metropolis-Hastings 
and our Theorem 1 is of interest for the research area they focus on. 

The independent Metropolis-Hastings algorithm is an efficient sampling 
algorithm only if qi is reasonably close to ir. However, it allows great freedom 
of adaption, as shown both here and in Gasemyr (2003). Adaption should 
also make the proposals resemble it, and hence the adaptive version will be 
an efficient sampler. We will prove a bound on the convergence that depends 
on the supremum of n/qi. An attractive property of independent proposals 
is their ability to make large jumps, and if this can be done while keeping the 
acceptance rate high, the autocorrelation of the chain will decrease rapidly. 

2. Definition of the adaptive independent Metropolis Hastings algorithm. 

The goal for the algorithm is to sample from a distribution ir, which is known 
except for a normalizing constant. The algorithm resembles the traditional 
Metropolis-Hastings algorithm (MH). A chain x is generated by drawing 
proposals Z{ from a proposal distribution gj(zj|rEj_i,y' t_1 ) and either accept- 
ing them, setting Xi = Zi, or rejecting them, setting Xi = Xi-±. The proposal 
history vector y % is defined as follows: If the proposal is independent of the 
current state, the history vector is extended by including if Z{ was rejected, 
and Xi-i otherwise. On the other hand, if the proposal is dependent on the 
current state, the history vector is kept unmodified, that is, y % = y l ~ l . The 
difference between traditional MH and adaptive independent MH is only 
that the proposal function qi may depend on the history vector. In adaptive 
independent MH qi may depend on all states where the function / = cir 
has been evaluated, except the current state of the chain, and these values 
can be used to make qi a better approximation of tt. Also when doing local 
steps conditioned on Xj, all information in y l can be used. The limitation is 
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that information gained from doing the local steps cannot be used, so these 
iterations do not improve the proposal. 
The full algorithm is given as: 

1. Set y° = 0. 

2. Generate an initial state xq € from the density pq. 

3. For i = 1, . . . , n: 

(a) Generate a state z$ from the density qi{zi\xi-\, y 1 ^ 1 )- 

(b) Calculate y^) = min{l, ^^f^ }■ 

(c) Set 




Zj, with probability a{(zi,Xi-i,y i 1 ), 

with probability 1 — cti(z{, y l ~ 1 )- 



(d) Set y* = y l 1 if g, depended on a?j_i. Otherwise, 

, , .,, ( Zi, if & was rejected , 

y l = y l 1 appended with ^ ' J ^ ,' 

[Xi-i, it zi was accepted. 

A classical random-walk MH algorithm where only states close to the 
present state in the chain are proposed, may stay close to a local optimum 
for a large number of iterations such that the user believes that the chain 
has converged. This may also be a problem with the presented algorithm, 
but if the proposal function is used properly, the probability for staying close 
to a local optimum is less with this algorithm than with classical MH since 
more information may be included in the proposal. However, if the proposal 
function adapts too fast, the presented algorithm may be even worse than 
classical MH. 

The presented algorithm gives a very large flexibility in choosing the pro- 
posal function Creating a good proposal function may be difficult. The 
choice depends on the problem we want to solve, the function 7r, the time 
used for an evaluation of ir, how to let qi approximate ir, and the CPU 
and software resources available. It may also be a challenge to use all the 
data y L effectively, in particular in higher dimensions. Theorem 2 gives a 
bound on the convergence based on how well qi approximates ir, similar 
to the convergence bounds for the Metropolis-Hastings algorithm given in 
Holden (1998b). High convergence rate also implies that the correlation be- 
tween state Xi and Xi + k decreases quickly as k increases. This may be as 
important as convergence if the chain is used to compute averages such as 

1 N 



for some function F. 
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Finding a suitable representation of qi that is able to use all the data y l 
effectively may be a challenge. For a parametric qi this is done by estimating 
the parameters from the knowledge about tt in the history states. This is 
illustrated in Example 4. However, ir may be too complicated for a simple 
parametric description. In these cases, nonparametric approaches can be 
used, as is done in three of the examples. It is always important that the 
proposal function does not adapt too fast and has heavy tails. This becomes 
clear when we see the convergence properties and the examples. Another 
problem is that the history vector grows large, and using all the information 
there for an update may become time consuming. Again, the method used in 
the examples shows how this information may be reduced. In cases where the 
evaluation of / = ctt is expensive, the overhead in computing the proposal 
even with a full history vector may be insignificant. Kriging or other spatial 
statistical methods are possible, but these alternatives may be complicated 
if the number data points become large. In a large class of problems [see, 
e.g., Park and Jeon (2002) and Kleijnen and Sargent (2000)], 

m 

j=i 

where (gi, #2> • • • , 9m) is the vector- valued response of a simulation model, Uj 
are weights and dj are the data. This is an inverse problem where it is nec- 
essary to find the parameter values x with uncertainty that give an observed 
response. For these problems we may use the evaluations of 9j{zi) to find an 
approximation to the functions gj(x) parametrically or nonparametrically. 
Example 4 illustrates this approach. 

3. Convergence. Let fi 1 = O C W 1 be a Borel- measurable state space 
or, alternatively, let O be a discrete state space, and = Q x $7*, and 
x l = [x\, . . . ,X{) G Q l . Let n(x l ) be the product measure on Q l of a a finite 
measure /j,(x). Further, let the density tt and the proposal functions qi be 
integrable with respect to \x including point mass distribution. In the nota- 
tion we neglect that y 1 may have dimension less than i in order to simplify 
the notation. This is the case if some of the proposal functions qi use local 
proposals. 

The convergence rate depends on the constant aj(y J_1 ) in the strong Doe- 
blin condition: Let aj(y* _1 ) G [0, 1] satisfy 

(1) qiizlx,^' 1 ) > ai$- x )ir(z) for all {z,x, f' 1 ) G ft m and all i > 1. 

The Doeblin condition essentially requires that all the proposal distribu- 
tion has uniformly heavier tails than the target distribution. This con- 
dition is always satisfied for a.;(y i_1 ) = 0. Theorem 2 below is valid also 
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in this case, but we only prove convergence if the expected value satis- 
fies -E(n^i(l — = 0. However, it may be useful to allow that 
= for some values of i. It is possible to obtain dj(y ) > by 
making the tails in the proposal function sufficiently heavy. This will usually 
be satisfied in a Bayesian approach using a prior with heavy tails. Alterna- 
tively, we may replace q%{z\x, y l ~ l ) by (1 — e)qi(z\x, y % ~ 1 ) + eg(z) where g{z) 
is a density with extremely heavy tails and e is small. The assumptions in 
the theorem are satisfied and we only "waste" at most e of the proposals. 
However, a small e may give slow convergence in the tails. The Doeblin con- 
dition is natural for an independent sampler. If it is not satisfied for some 
states in fi, the algorithm will tend to undersample these states. This is not 
crucial if the probability mass of these states is low and further inference 
does not depend on tail behavior. In one of our examples the stationary dis- 
tribution has heavier tails than the proposal distribution, but convergence 
is still achieved with the measure we use. 

Let Pi(xi) be the distribution for Xi after i iterations and pi(xi\y l ) the 
conditional distribution for Xi given the history vector. The following theo- 
rem is crucial in the understanding of the algorithm, and is the basis for the 
convergence result. 

Theorem 1. The limiting density ir conditioned on the history is in- 
variant for the adaptive independent Metropolis-Hastings algorithm, that is, 
Pi(xi\y l ) = ir(xi) implies p i+ i(x i+1 \y l+1 ) = ir(x i+1 ). 

It is possible to combine this theorem with Theorem 6 in Roberts and Rosenthal 
(2005) to get the following result: If we assume that the proposal function 
for all histories is uniform ergodic and the adaption diminishes, then the 
algorithm converges in the total variance (TV) norm 



JQ 

In this paper we will instead assume the strong Doeblin condition and 
obtain geometric convergence. 

Theorem 2. Assume the adaptive independent Metropolis-Hastings al- 
gorithm satisfies (1). Then 




(2) 



bi-7r|| TV <2 / m)\{^-a0' l ))d^ 1 ) 



(3) 
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If a j{& X ) > a j f or a M 3 an d y J 1 £ 1 , then 



(4) 



||K-vr|| T v<2n( 1 -^)- 



The algorithm converges if this product goes to zero when i — > oo . If aj > a > 
infinitely often, the algorithm samples from the target distribution within 
a finite number of samples with a probability arbitrarily close to 1. 

The proofs are given in the Appendix. Theorem 2 says that convergence 
is geometric as long as a strong Doeblin condition is satisfied for all possible 
histories with aj(yi~ l ) > a > for all j and £ fi J . In each iteration 
the chain jumps to the limiting density ir with probability aj{y^~ l ). Then the 
chain remains in this density according to Theorem 1 since tt is invariant for 
the adaptive independent Metropolis-Hastings algorithm. The probability 
for not sampling from the limiting density after j iterations is .E7(]Tj=i(l — 



If the adaption succeeds in generating better proposal distributions, a^y 1 " 1 ) 
will increase as i increases, and the convergence will be accelerating. This 
also means that the number of iterations needed to generate a set of inde- 
pendent samples decreases. 

4. Examples. This section gives four examples. All examples are schematic 
examples where we compare different Metropolis-Hastings algorithms: in- 
dependent sampler, random-walk proposing small jumps and adaptive inde- 
pendent sampler where the proposal function is higher close to the modes 
that are identified so far. The first example is a quantitative comparison 
showing that the adaptive independent sampler converges faster, identifies 
modes better and jumps more often between different modes than the other 
algorithms. In Examples 2 and 3 the same adaptive independent algorithm is 
used showing how flexibly the algorithm may be used in different cases. The 
three first examples are in one or two dimensions. We do not believe that the 
dimension is critical. What is important, though, is how sharp the modes are 
and the number of modes. But in high dimension one may prefer to change 
only a few variables in each iteration if this reduces the number of calcula- 
tions per iteration and not because this is an efficient Metropolis-Hastings 
algorithm per iteration. The final example illustrates how to combine the 
proposed algorithm with an external simulation model that is assumed to 
be very demanding to evaluate. This is a typical problem in a large number 
of applications. 

Example 1. Let tt be the function 



a j{y j ~ x )))- 



7r(x) = 4cmin{(x + |) a , (| 



x) a } + cmin{(x + |) a , (| - x) a } 
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for constants c and a and for < x < 1. See Figure 1. We will compare 
three different proposal functions denned in < x < 1. Let the independence 
sampler have proposal function q 1 (x) = 1. The random-walk sampler is given 
the proposal function qf(x) = 1/L for — L/2 < x < scj-i + L/2, and 
<jf (a;) = otherwise, where Xi—\ is the present position of the chain and L is 
the maximum step length. In the case Xi-% is close to or 1, it is necessary 
to reduce L in this iteration in order for the proposal function to be a proper 
density. This is very unlikely to happen except in the first few iterations. 
Before we can describe the proposal function for the adaptive independent 
sampler we need to define the two variables z\ £ y % ~ 1 such that f(zj) > f(z) 
for all z € y l ~ l and zj,z < 0.5 and that f(zf) > f(z) for all z G y l ~ l and 
zf,z > 0.5. Hence, the zf's are our best guess on the two modes after i 
iterations of the Markov chain. The adaptive independent sampler is then 
given the proposal function 

q 3^. = [ 1 - 2p + p/L, if 4 -L/2<x< z\ + L/2 for j = 1, 2, 
1 1 — 2p, otherwise, 

for two constants p and L. 2p is the probability for a proposing a local jump 
and L is the maximum length of a local jump. Also for this sampler it may 
be necessary to reduce L in some iteration in order for the proposal function 
to be a proper density. 

In the numerical calculations we set a = 2000; see Figure 1. Then the 
modes are so sharp that 0.996 of the probability mass of ir(x) is located in 
two small intervals close to the modes with total length 0.01. Even though 
this example is schematic we believe it is quite representative for many 
MCMC problems. The typical MCMC is a random walk making small steps 
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Fig. 1. The function n(x) in Example 1 with a — 30. 
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in each iteration. Often it is not critical whether this is made in one dimen- 
sion at a time or there is only one dimension. We set the maximum local 
step length L = 0.02 in order to get an acceptance rate of about 0.25. If the 
step length is larger, the algorithm finds the mode faster but the acceptance 
rate decreases. The random-walk algorithm is very slow to move between 
the modes. This is illustrated in the lower left part of Figure 2 showing the 
minimum of the iteration number and the number of iterations since the 
chain was closer to the other mode. For random walk this is almost the 
same as the iteration number. Then the algorithm uses a very long time to 
find out that the probability mass close to the mode at x = 1/3 is larger. 
This slows down the convergence after about 100 iterations. The adaptive 
chain finds the modes faster than random walk. This is seen both in the 
convergence and in the ratio 7r(l/3)/pi(l/3) in the lower right part of the 
figure. Another important property is that the adaptive chain jumps easily 
between the modes. If we increase L, then the adaptive chain finds the mode 
faster, but does not jump as easily between the modes. We set p = 0.4 in 
order to jump often between the possible modes. The independence sampler 
uses many iterations to find the modes and the acceptance rate is very small. 
If we increase a making the modes sharper, then the difference between the 
different algorithms becomes even larger. 

Example 2. The adaptive chain is tested on a multimode example from 
Tjelmeland and Hegstad (2001). Let O = R n and f(x) =J2j=i^j i P^,s 3 {x) 
where V 7 ^,^ * s the normal density in R 2 with expectation fij and correla- 
tion matrix T,j. The constants Uj > satisfy ^j^j = 1- A natural proposal 
function to use in an adaptive Metropolis-Hastings algorithm, both here 
and generally, is a mixture of normal distributions: 

(5) qfizlx,^- 1 ) OC7&^,Ao(3) + ]C r V^<j.Af( Z )- 

3=1 

The expectation is positioned central in the distribution. The correspond- 
ing variance Ao is large. The other expectations i/jj for j > are estimates 
of where the undersampling of ir is largest. The weights Tj j determine how 
often each distribution in the mixture is used for the proposal. 

During the simulation we update a list of possible modes {vi,jYjL\ C y l ~ l - 
The list is empty when we start and is updated by the algorithm below based 
on the function R{y) = f(y)/(p UQ) A (y) with invariant R(v\) > R{i>2) > • • ■ 
There is a maximum length of the list, that is, rij < M. The most recent 
state in the history y is considered included in the list by the algorithm: 
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Fig. 2. The figure compares the three Metropolis-Hastings algorithms: independence sam- 
pler (line), random walk (dotted) and adaptive independent sampler (dashed). We see that 
the adaptive chain converges faster, jumps faster between the modes and obtains the cor- 
rect density in the mode faster than the two other algorithms. Random walk has a higher 
acceptance rate, but with small steps each time. The simulation is based on 10,000 samples. 



1. If R(y) > R(u M ) or m < M then: 
(a) For j = 1,.. . , minjnj, M} 

i. If R(y) > R(vj) 

A. <Include y in list before Vj> 

B. For k = j + +2,..., mm{ni,M} 

If \\y — i/fcH < ei/2 then 

< Remove v\. from list and exit loop> 

ii. Else if \y — za-|| < E\ exit loop. 

The index i is omitted in the algorithm and || • || denotes the Euclidean norm. 
Only m-i < M of the Vj are used in the proposal function. More states are 
kept in the list since a new mode may remove several old ones. Let rrii = 
min{Mo, n,i\ and tq = 0.5. The weights Tjj for j = 1, 2, . . . , are defined as 
Tjj = l/(5Mo) + cf(ui t j) where c is defined such that J2j T i,j = 1- 

In addition, it makes sense to actively decrease the proposal probability in 
areas where previous proposals have shown that / is small. This is especially 
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Fig. 3. Location of modes in the mixed normal distribution used as target distribution. 

attractive when evaluation of / is expensive. A list of states £jj are found, 
similarly as z/jj are found, maximizing S(y) = ^ (y)/ f(y) for an exponent 
p > 1. The constants N, Nq and £2 are defined as M, Mq and £\, respectively. 
We can now actively reduce the sampling likelihood for the first term in (5) 
by introducing a new factor pi(z,y l ~ 1 ), giving the proposal 



(6) qf{z\x,y l x ) cxT pi(z,y l v 



3=1 



The function pi{z,y l ~ l ) is equal to a constant q > 1 except if the function 
S(y) has been evaluated for y close to z and had a small value. In the latter 
case pi(z,y l ~ 1 ) is small. Define the criteria Ai >t so that \\z — ^|| < £2 and 
11-2 — fell — £ 2 for all k < t < Nq. The following definition for p is used: 



Pi(z,tf X ) 



max{<5,/(^ )t )/^ 0) A (^,t)}, 
^7 . 



Ai :t satisfied, 

Ai t not satisfied for any t. 



The constant Cj is adjusted during the simulation in order to ensure that 
approximately 50% of the proposals come from the first term of the proposal 
function (6). Careful considerations on the smoothness of / should be made 
when choosing 62, since decreasing the proposal probability in an area also 
may worsen inf {qi(z\x, y t ~ 1 )/ir(z)}, and hence worsen convergence. 

The test example is the same as in Tjelmeland and Hegstad (2001) with 
k = 13 and modes pj located as shown in Figure 3. Each mode i has the same 
weight uJi = 1/13 and covariance E, = diag(0.01 2 , 0.01 2 ). The variance is so 
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21X0 4000 6000 8000 10000 2000 4000 6000 6000 10000 

Numb&i or rt&ialions number at iterations 



q Li convergence 




Number or nerali<j*i5 Number oi iterations 

Fig. 4. The chains: independent sampler (q 1 , upper left), random walk (q 2 , upper right), 
adaptive chain (q , lower left) jump between the 13 modes. The distance between a line and 
a dot indicates the distance between the sample and the mode. The modes closest to origin 
have smallest y value. The adaptive chain using the list converges fastest (lower right), 
closely followed by the other adaptive chain, then the independent sampler and finally the 
random-walk sampler. The last one is very slow since the probability of sampling far from 
origin is very small. 



small that a random walk between the outer modes is very unlikely, whereas 
random walks between the five center nodes are plausible. The constants 
in the adaptive chain are Mq = 20, M = 25, and E\ = 0.05. The variances 
are set equal to A = diag(l,l) and Aj = diag((0.03) 2 , (0.03) 2 ) for j > 0. 
Two adaptive chains are evaluated: one with p = 1 (q 3 ) and another (q 4 ) 
where p depends on the £. r list using Nq = N = 1000, £2 = 0.05, 5 = 0.1 and 
p = 1.3. In addition two Metropolis-Hastings algorithms are implemented as 
a comparison with an independent proposer q l (y) = ipv ,A (y) an d a random- 
walk proposer q 2 {y\x^i) = ^• l _ 1 ,(o.3) 2 A (y)> respectively. 

Figure 4 shows the results from the experiments. The two adaptive chains 
are quite similar with acceptance rates equal to 0.07 and 0.09, respectively. 
The Markov chains have acceptance rates 0.001 and 0.004, respectively. The 
random walk jumps more often, but less often between modes. The adap- 
tive chains resample often in modes that are identified earlier in the chain. 
The first time a mode is sampled, the chain leaves this mode with a small 
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probability. Note also that the independent Markov chain does not sample 
as close to the mode value as the adaptive chain. This is the main rea- 
son for the difference in convergence, and is due to more sampling in the 
modes by the adaptive chain. The convergence is evaluated by calculating 
J2j \ r j — 1/52| where rj is the ratio of chains within 52 regions which have 
the same probability in the target distribution /. There are four regions 
around each mode and each of these regions is described by the distance to 
the mode. The convergence measure using 20,000 chains will not be below 
0.05 due to noise. The adaptive chain converges faster than the indepen- 
dent sampler. More important is the difference in mixing. In order to get 
1000 samples with low correlation, the adaptive chain needs approximately 
20,000 iterations (burn in 3000 and then sample every 10 iterations) and 
the independent Markov chain needs in the order of 1,000,000 (sample every 
1000 iterations). There are several possibilities for further improvement. The 
variances Aj and the constants and 5 may be adapted based on y 1 " 1 . It 
is possible to run a local optimizer starting with some of the states in the 
v list and include the endpoints of these optimizations into the list. The 
states in the £ list could be sorted into regions in order to avoid checking 
the complete list. Which variant of the algorithm is the best depends on the 
properties of /, in particular the cost of evaluating the function, in addi- 
tion to the CPU resources available. The increased CPU resources required 
by changing from a traditional Metropolis-Hastings algorithm to an adap- 
tive independent Metropolis-Hastings algorithm with q 3 , are not large, but 
involves some programming. To use the proposal function q 4 increases the 
CPU resources considerably, and should only be used if the evaluation of / 
is very expensive. 

Note that even though the target distribution here was on the same form 
as the proposal, this method of adapting a proposal distribution is of general 
value. It can be viewed as a nonparametric approximation of the target 
distribution, bearing some resemblance to kernel methods. 

Example 3. Let / be the Cauchy distribution, f(x) = l/(l + x 2 )7r. This 
distribution has heavier tails than the normal distribution and is known 
to give problems in the simulation; see Roberts and Stramer (2002). The 
same simulation algorithm as in Example 2 is used in this example. Only 
the adaptive chain q 3 and the independent Metropolis-Hastings algorithm 
q l (y) = ipo,i(y) are tested. The constants in the adaptive chain are Mo = 70, 
M = 80 and £\ = 0.05. The variances are set equal to Ao = 1 and Aj = (0.5) 2 
for j > 0. The modes v%i in the proposal function will be in the tails of 
the distribution of / with distance approximately E\ between neighboring 
modes. Figure 5 shows one chain for the two methods. Figure 6 shows the 
convergence estimated by dividing the state space into 20 equally likely 
regions for \x\ and uses the same norm as in the previous example. Ten 
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Fig. 5. The Markov chain (q , left) samples the tails too seldom. The adaptive chain 
(q 3, , right) increases the sampling of the tails as the modes v.,. identify these areas. 
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Fig. 6. The adaptive chain converges much faster than the Markov chain. The adaptive 
chain reaches the Monte Carlo noise level for the convergence measure. 



thousand chains give a Monte Carlo noise level of approximately 0.07. The 
acceptance rates are 0.7 and 0.8, respectively, slightly higher for the adaptive 
chain. The adaptive chain reduces the problem with heavy tails by position- 
ing modes in this area. This shows that the adaptive chain may be useful 
also when there are large differences between the limiting function and the 
proposal function. Both the adaptive and the independent chain converge 
even though the Doeblin condition is not satisfied. The independent sam- 
pler converges very slowly since the probability for sampling a state far from 
origin is too small. When such a state is sampled, the probability for leaving 
the state is very small. 

Example 4. In many applications in, for example, climate or petroleum, 
there are large simulation models. If each simulation takes several hours 
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of CPU, it is necessary to carefully use all available information including 
previous runs before we start new simulations. There is often uncertainty 
in some input parameters and we want to find distributions for the input 
parameters that satisfy data that is output from the simulation model. It 
is outside the scope of this paper to describe such a large simulation model 
and we will illustrate this problem with a more schematic example. 
We illustrate the simulation model with the function 

5 

/(x) = 3sin(xi7r) — x\/2 + sin(xj7r/2) 

i=2 

assuming < xi < 1 for i = 1,2,3,4,5. We use a Bayesian approach with a 
noninformative prior and the likelihood 

/(x|d)=exp(-(/(x)-d) 2 /^ 2 ). 

Then we want to simulate x proportional to l(x\d), but evaluate the function 
/(x) as few times as possible. The function /(x) may be approximated by 
the linear regression 

5 

/(x) = a + aiXi + bx\ 
i=l 

where the constants aj for i = 0,1,2,3,4,5 and b are evaluated from the 
evaluations of the function /(x). Notice that most evaluations will be in the 
area where the likelihood is highest, that is, the linear regression will be 
best in the area of largest interest. In the adaptive independent sampler the 
proposal functions consist of two steps. We first propose uniformly x £ (0, l) 5 
and then accept the proposal proportional to the function 

/(x|rf)=exp(-(/(x)-d) 2 /(5 ( 7 2 )). 

Notice that we have included the factor 5 in the exponent in order to propose 
from a slightly larger area than the likelihood. This algorithm is compared 
with the independent sampler proposing uniformly from x £ (0, l) 5 and the 
random-walk sampler where each component of x is proposed changed uni- 
formly in an interval with length L = 0.1 relative to the current state of the 
chain. The length of the interval is reduced close to the boundary of the 
domain. 

We set d = 2.5 and a 2 = 0.005. Then the likelihood is largest at two four- 
dimensional planes intersecting the domain close to two of the corners of the 
domain of /. The lower right figure in Figure 7 shows the likelihood along a 
line intersecting these two planes. The simulation shows that the adaptive 
sampler converges fastest and jumps easily between the two modes. The 
independent sampler jumps seldom and the random-walk sampler is not 
able to move between the modes and hence converge. 
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Fig. 7. The figure compares the three Metropolis-Hastings algorithms: independence sam- 
pler (line), random walk (dotted) and adaptive independent sampler (dashed). We see that 
the adaptive chain converges faster and jumps faster between the modes. Random walk 
has a higher acceptance rate, but with small steps each time. The simulation is based on 
50,000 samples. The lower right figure shows the likelihood l(xi, 0.1, 0.1, 0.1, 0.1, 0.1|d) and 
the proposal function based on 10 and 50 evaluations of the simulation model along a line 
crossing the two planes with the modes. 



The traditional statistical approach for this problem would be to first 
perform some simulations of the large model estimating the linear regression 
parameters and then use the linear regression in the proposal function in a 
Metropolis-Hastings sampler. This implies to use no information in the data 
collection step and not the new information gained during the Metropolis- 
Hastings sampling. The adaptive sampler always uses all the information 
and all the time focuses on the most interesting area based on the available 
information. In a practical problem we will often not know the number of 
modes and whether we have found all the modes or not. The industrial 
practice in many such problems is often to neglect the uncertainty in the 
parameters and estimate the parameters using a standard gradient search 
optimization algorithm. 

5. Concluding remarks. The algorithm presented here allows fairly gen- 
eral adaption, based only on the assumption that a proposal distribution 
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satisfying a strong Doeblin condition can be found. Convergence is geomet- 
ric, with a rate that increases as the proposal distribution gets closer to the 
target distribution. The chain is also invariant for the limiting distribution in 
contrast to the more general adaptive chain. All previously proposed states, 
except the current state, can be used to generate the new proposal, allowing 
almost all previously gained information about the target distribution to be 
used. The algorithm is tested in four schematic examples, three with several 
modes and one with a heavy tail distribution, situations that are generally 
difficult for MCMC methods. Three examples are nonparametric and one is 
parametric. In all cases, the adaptive algorithm performed better than the 
standard MCMC alternatives we used for comparison. 

The adaptive independent Metropolis-Hastings algorithm is a special case 
of the adaptive chain described in Holden (1998a), where convergence of the 
adaptive chain was proved by assuming that the detailed balance condition 
was satisfied for each possible history. This did in general not give invariance, 
which is a property of the adaptive independent Metropolis-Hastings. Con- 
vergence is here proved without assuming the detailed balance condition. 
Instead, we use rejection sampling and a restricted history which assures 
stationarity when the target distribution is reached. 

The proposed algorithm may be considered as a generalization of the 
commonly used Metropolis-Hastings algorithm. If the proposal function in 
an iteration does not depend on the present state, then the present or the 
proposed state may be used for improving later proposal functions. Which 
state that may be used depends on whether the proposed state is accepted 
or not. This simple property contributes to the understanding of the com- 
monly used Metropolis-Hastings algorithm. The paper also shows the close 
relationship between rejection sampling, MCMC and adaptive chains. 



Proof of Theorem 1. If the proposal function qi + \ depends on the 
present state Xi, then the history is not extended and the theorem follows 
from standard Metropolis-Hastings theory. We will therefore focus on the 
case where qi + \ is independent of the present state Xj, and we will use the 
notation q i+1 (z i+ i\y % ) . 

Assume Pi(xi\y % ) = ir(xi). Let w be the state added to the history in step 
i + l, that is, y l+l = (y l ,w). Then w £ Q, is either Xi or Zi + \ depending on 
whether the proposal Zi + \ is accepted or rejected in the (i + l)th iteration. 
The joint distribution of is 
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+ 7r(x i+1 )qi + i(w\y % )(l - a i+1 (w,x i+1 ,f))) 

= Pi{y % )(n{xi + i)q i+ i{w\tf) 

+ ir(w)q i+ i (x i+ i\f)a i+ i (x i+1 ,w,y % ) 
-7r(x i+1 )qi + i(w\y l )a i+1 (w,Xi + i,y' 1 )) 

= Tt{x i+ i)pi(y % )q i+l (w\y l ). 

This shows that the chain never leaves the stationary distribution once it is 
reached and we have pi + i(xi + i\y l+1 ) =7r(sci+i). 

Except for minor changes in the notation, the calculation above may also 
be used to prove the theorem when qi + \ depends on the present state x%. 
Then we must integrate over w to get pi + i(xi + i\y t ) = ir(xi+x). This is why 
the history cannot be extended in these iterations, and we set y l+1 = y l . The 
critical point in the entire paper is the above calculation. Notice that it is 
essential that the proposal function does not depend on the present state, 
but may depend on the entire history y l . □ 

Proof of Theorem 2. Given that the limiting density is invariant 
for the algorithm conditioned on the history, the next step is to prove that 
a chain can reach the stationary distribution. This is done by observing 
that each iteration with an independent proposal also can be viewed as an 
iteration of a rejection sampler. Let it, be distributed uniformly between 
and 1, and the proposal Zi be accepted if U{ < on. Let Ai be the condition that 
Uiqi(zi\y t ~ 1 )/ir(zi) <ai(y 4_1 ). Then the distribution of Zi given by 
and A4 is proportional to 

Pro^Aily 1 ' 1 ,x^, z^q^Zilxi^,^ 1 ' 1 ) 

= Piob(u i <a i (y i ~ 1 ) ^^K- n W^-i, jf" 1 ) 
V qi(zi\xi-i,y l - l )J 

= a i (y i ~ 1 )Tr(z i ). 

This means that if Ai is satisfied, Zi is a sample from tt. Furthermore, if 
Ai is satisfied, m < «j, and hence Zi is always accepted. If qi does not de- 
pend on Xi-i, then Xi-i is appended to the history, and we conclude that 
Pi(x\y l ,Ai) =tt(x). If qi depends on we must integrate over the dis- 

tribution of Xi—i given y l ,Ai to obtain the same conclusion. Finally, Ai 
is satisfied with probability aj(y 4_1 ). This implies that the probability for 
jumping to the limiting distribution in iteration i is ai(y J_1 ) independent of 
the distribution of Xj. From Theorem 1 we see that when we reach the lim- 
iting density we remain in the limiting density. Hence, this is an absorbing 
state with a probability larger than a > to jump to this in each iteration. 
Then the probability to be in the limiting density within a finite number of 
samples may be made arbitrarily close to 1. 
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Let 



The probability for Xi being from the limiting distribution is then described 
by the following chain: Let Iq = and for i > 



t+i 



0, with probability 1 — ai + i{y % ) if Ii = 0, 

1, otherwise. 



Clearly 



Probft = oif- 1 ) = n(i - = W" 1 )- 

This implies that 
(7) PiNf!*) = vr(xi)(l - + v i (x i \y%(y i ) 

where Uj is a distribution. This gives the following bound on the error: 



\\Pi ~ tt||tv 



Pi(xi\f)Pi{y l ) dfi(f) -vr(xi) 



d/j,(xi 



xpity^dniy 1 



dp,(xi 



,{vi(xi\y l ) - Tv{xi))pi{y l )bi{y l l )d^(y l ) 



d/j,(xi 



<2 / PMMy'ld^m, 

J a* 

proving (3). 

If aj + i{y 3 ) > aj+i for all j and y- 7 6 JP, then (4) follows trivially. Note 
that the probability for jumping to the limiting distribution in iteration i is 
a i{y % ~ 1 ) independent of the distribution of X\. □ 
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